Felipe Ferrão
University of Florida -- Oct 23, 2018
Present intuitive examples on GWAS analysis
Discuss R codes and models
Mixture of theory and hands-on
Molecular Markers (ex: SNPs)
Linkage Disequilibrium (LD)
Linear Regression
Confounding factor (false-positve)
Hypothetical species with 3 chromosomes
Molecular markers along the genome
Quantitative trait ?
What source of observation is known?
What source of observation is unknown?
The idea of associating between markers and genes; and assume that both segregate together across generations refers to the concept of linkage disequilibrium
If marker and gene are in equilibrium, the knowledge of the marker does not present any value to the selection.
LD and physical linkage are not synonymous !!!
LD is influenced by many factors, including selection, mutation, genetic drift, the system of mating, population structure and genetic linkage
Toy example and two questions:
My trait: plant size
12 homozygous individuals (AA or aa)
3 SNPs genotyped in one chr.
Can I use markers to study the phenotypic variation observed?
What marker (SNP1, SNP2, or SNP3) ?
trait = c(50,15,18,68,5,15,15,10,60,15,40,12)
snp1 = as.factor(c("AA","aa", "aa", "AA", "aa", "aa","aa","aa", "AA", "aa", "AA", "aa"))
snp2 = as.factor(c("AA","AA","AA","AA","AA","AA","aa","aa","aa","aa","aa","aa"))
par(mfrow=c(1,2))
plot(trait~snp1, main= "SNP1")
plot(trait~snp2, main="SNP2")
SNP1 may be associated with plant size.
How to formulate our hypothesis?
Linear Regression is a method that summarizes how the average values of an outcome variable (phenotypes) vary in function of the predictor variables (molecular markers).
Ex: height ~ body mass
Derive a estimator that minimize the prediction error variance, which is the expected value of the squared difference between predicted values and their unobserved true values
Intuitively: given that we are trying to predict an outcome using other variables, we want to do so in such a way as to minimize the error of our prediction
\[ y_i = \beta_0 + \beta_1 x_i + e_i \]
\[ \hat{\beta} = (X'X)^{-1}X'y \]
\( H_0:\beta=0 \) \( H_1:\beta \neq 0 \)
summary(lm(trait~as.numeric(snp1)))$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) -28.250 6.468433 -4.367364 1.404874e-03
as.numeric(snp1) 41.375 4.573873 9.045945 3.952776e-06
summary(lm(trait~as.numeric(snp2)))$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 22.166667 20.60104 1.0759975 0.3072024
as.numeric(snp2) 3.166667 13.02924 0.2430431 0.8128848
10,000 SNPs for 2,000 individuals
10 true QTLs
File gwasData.rds = SNP genotype (0,1,2)
File map.txt = SNP map position
File phenotypes.txt= phenotypic data
# Read in data - simulated with 10 QTL
gwas=readRDS("gwasData.rds")
pheno=read.table("phenotypes.txt",
header=T,sep="\t")$Pheno
map=read.table("map.txt",header=T,sep="\t")
dim(gwas)
[1] 10000 2000
gwas[1:5,1:5]
[,1] [,2] [,3] [,4] [,5]
[1,] 1 0 2 0 1
[2,] 1 1 0 1 1
[3,] 1 2 1 2 1
[4,] 0 2 0 1 0
[5,] 1 0 0 0 2
map[1:3,]
snp chrom loc
1 1 1 1
2 2 1 2
3 3 1 3
lm() R function implements the least squares to estimate the intercept and slopeeffect=numeric(10000) # effect sizes (coefficients)
pval=numeric(10000) # p-values
for (i in 1:10000){
res=coef(summary(lm(pheno~gwas[i,])))[2,c(1,4)]
effect[i]=res[1]
pval[i]=res[2]
}
effect[1:4]
[1] -0.4989296 0.1690589 0.1876288 -0.1043797
pval[1:4]
[1] 0.01431413 0.40415900 0.35829852 0.60490297
plot(effect,xlab="single SNP regressions",main="SNP effect estimates",pch=20, cex=3)
abline(v=QTL$indexQTL,col="gray")
points(QTL$indexQTL,QTL$QTLval,col="red",pch=17,cex=2)
legend("topright",c("single SNP","true QTL"),
fil=c("black","red"),cex=0.8)
library(qqman)
library(RColorBrewer)
mypalette<-brewer.pal(9,"Set1")
df = data.frame(SNP=map$snp,CHR=map$chrom,BP=map$loc,P=pval)
manhattan(df,col=mypalette,suggestiveline=F,genomewideline=F,cex=2)
90 students in the classroom
Relationship between hair size and height of people
Statistical Model: hair_size = mean + height + error
Estimate Std. Error t value Pr(>|t|)
(Intercept) 38.8703831 3.60057792 10.795596 8.503132e-18
hight -0.1845235 0.02065016 -8.935693 5.568133e-14
We also took notes about sex.
On average, girls have longer hair and are shorter than boys.
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.577631689 4.45579195 1.2517711 2.140084e-01
as.factor(cor)red 2.931686390 0.31974835 9.1687303 2.026746e-14
hight -0.002799516 0.02474259 -0.1131456 9.101758e-01
Confounding structures leads to false positive that ends up with spurius association and over estimation of significance of associations
Population structure
Crypitic relateness
Important paper: Overview of techniques to account for confounding due to population stratification and cryptic relatedness in genomic data association analyses. M J Sillanpaa, 2011
What is the solution?
Original model: y=mean + snp + error
Q+K model: y = mean + snp + pop.struc + kinship + error
Important points:
Population structure as covariables (PCA or STRUCTURE software)
Kinship matrix to control unequal familial relationship
Genetic effect is random – \( g \sim N(0,K\sigma^2_ g) \)
Original idea: A unified mixed-model method for association mapping that accounts for multiple levels of relatedness (Yu et al.,2006)
329 oat genotype lines
1962 filtered markers
Yield trait evaluated in four environments
# 1) Kinship matrix (rrBLUP, AGHmatrix, pedigreem and etc)
library(rrBLUP)
Imputation <- A.mat(geno.filtered,impute.method="EM",return.imputed=T,min.MAF=0.05)
K.mat <- Imputation$A ### KINSHIP matrix
K.mat[1:5,1:5]## view Kinship
Oat1 Oat2 Oat3 Oat4 Oat5
Oat1 1.8191561 0.220454719 0.1009423 0.15218203 0.177660657
Oat2 0.2204547 2.111943356 0.1266988 0.02978199 -0.007214392
Oat3 0.1009423 0.126698834 1.8000414 -0.12678093 -0.126589635
Oat4 0.1521820 0.029781989 -0.1267809 1.87520005 1.803271696
Oat5 0.1776607 -0.007214392 -0.1265896 1.80327170 1.873595678
M1 (Simple): y = mean + env + snp + error
M2 (Q model): y = mean + env + snp + PCA + error
M3 (K model): y = mean + env + snp + kinship + error
M4 (Q+K model): y = mean + env + snp + PCA + kinship + error
#2) PCA is computed internally
# pheno = data.frame with phenotypes
# geno = genotype scored as -1,0,1
# K = pedigree matrix
# n.PC = number of principal components to include as fixed effects
library(rrBLUP)
gwasresults<-GWAS(pheno.gwas,geno.gwas2, fixed=colnames(pheno.gwas)[2:5], K=NULL,n.PC=0)
gwasresults2<-GWAS(pheno.gwas,geno.gwas2, fixed=colnames(pheno.gwas)[2:5], K=NULL,n.PC=6)
gwasresults3<-GWAS(pheno.gwas,geno.gwas2, fixed=colnames(pheno.gwas)[2:5], K=K.mat,n.PC=0)
gwasresults4<-GWAS(pheno.gwas,geno.gwas2, fixed=colnames(pheno.gwas)[2:5], K=K.mat,n.PC=6)
QQ-plot: quantile distribution of observed p-values (on the y-axis) on the quantile distribution of expected p-values (on x-axis).
It is is a visual diagnostic tool that can identify the effects of accounted for population structure, relatedness and other issues in GWAS analysis
An example that pouplation structure effects were not strong
Original paper: Genome-wide association mapping of quantitative resistance to sudden death syndrome in soybean (Wen et al., 2014)
Simple Marker Regression vs. Q+K model
Linkage Disequilibrium: Flint-Garcia et al., 2003
Confounding due to population stratification and cryptic relatedness in GWAS: Sillanpaa, 2011
K+Q Model: Yu et al.,2006
Review the remarkable range of discoveries in GWAS: Visscher et al., 2017
Nature Review: Bergelson and Roux, 2010
GWAS in maize:Xiao et al., 2017